Exploiting Concept Clumping for Efficient Incremental News Article Categorization

نویسندگان

  • Alfred Krzywicki
  • Wayne Wobcke
چکیده

In this paper, we introduce efficient methods for incremental multilabel categorization of documents. We use concept clumping to efficiently categorize news articles into a hierarchical structure of categories. Concept clumping is a phenomenon of local coherences occurring in the data and it has been previously used for fast, incremental e-mail classification. We extend the definition of clumping and introduce additional clumping metrics specifically for multi-label document categorization. We present three methods for incremental multi-label categorization that exploit concept clumping and make use of thresholding techniques and a new term-category weight boosting method. Our methods are tested using the Reuters (RCV1) news corpus and the accuracy obtained is comparable to some well known machine learning methods trained in batch mode, but with much lower computation time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization

We introduce a novel approach to incremental e-mail categorization based on identifying and exploiting “clumps” of messages that are classified similarly. Clumping reflects the local coherence of a classification scheme and is particularly important in a setting where the classification scheme is dynamically changing, such as in e-mail categorization. We propose a number of metrics to quantify ...

متن کامل

Using Concept Relationships to Improve Document Categorization

In the information age we much depend on our ability to find information hidden in mostly unstructured and textual documents. This article proposes a simple method in which (as an addition to existing systems) categorization accuracy can be improved, compared to traditional techniques, without requiring any time-consuming or language-dependent computation. That result is achieved by exploiting ...

متن کامل

TRESTLE: Incremental Learning in Structured Domains using Partial Matching and Categorization

We present TRESTLE, an incremental algorithm for probabilistic concept formation in structured domains that builds on prior concept learning research. TRESTLE works by creating a hierarchical categorization tree that can be used to predict missing attribute values and cluster sets of examples into conceptually meaningful groups. It is able to update its knowledge by partially matching novel str...

متن کامل

Exploiting News to Categorize Tweets: Quantifying the Impact of Different News Collections

Short texts, due to their nature which makes them full of abbreviations and new coined acronyms, are not easy to classify. Text enrichment is emerging in the literature as a potentially useful tool. This paper is a part of a longer term research that aims at understanding the effectiveness of tweet enrichment by means of news, instead of the whole web as a knowledge source. Since the choice of ...

متن کامل

A Hybrid Framework for Building an Efficient Incremental Intrusion Detection System

In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011